Skip to content

Conversation

@hgreebe
Copy link
Contributor

@hgreebe hgreebe commented Dec 17, 2025

Description of changes

  • Changes how compute nodes handle in place updates so that we no longer rely on cfn-hup running on compute nodes.
  • Replaces cfn-hup with a systemd timer that periodically checks a file in shared storage and runs an update if that file has been modified with the new cluster config version that has not yet been applied to the compute nodes. This file is updated by the head node when a cluster update occurs in order to signal to the compute nodes to update.
  • The check-update.service has a 30 second timeout so that it does not run indefinately if something hangs.
  • Revert changes for in_place_update_on_fleet_enabled from: 6eda378#diff-6d6c58cce2dd575c0638ee245d9647b0dfa3cbdef86a136bd816d00538529fb4

Tests

  • Created unit tests to cover the systemd timer and service as well as new update logic
  • Ran all the update integ tests

References

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hgreebe hgreebe requested review from a team as code owners December 17, 2025 13:28
@hgreebe hgreebe force-pushed the develop branch 2 times, most recently from 5ce2f72 to 96d1cbd Compare December 17, 2025 15:38
@@ -0,0 +1,11 @@
[Unit]
Description=Check file modification time every minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description must be more specific: it must be more clear that this is a timer configured by pcluster (add this info in the description and in the file name) and the goal of the timer (checks file modifications, which files, why?)

Description=Check file modification time every minute

[Timer]
AccuracySec=1s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accuracy has an impact on CPU wake-ups.
The finer the accuracy, the higher the chances to generate load on the CPU.
Since we are using a 60sec activation, what about using a 20s accuracy?

See https://www.freedesktop.org/software/systemd/man/latest/systemd.timer.html

# Cookbook:: aws-parallelcluster-slurm
# Recipe:: config_compute
#
# Copyright:: 2013-2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and in other files: fix copyright year with Copyright:: 2025 Amazon.com


#
# Cookbook:: aws-parallelcluster-slurm
# Recipe:: config_compute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix recipe name

# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
# limitations under the License.

template '/etc/systemd/system/check-update.service' do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, above and below: the name of the service and the corresponding timer must be more talkative, expressing that they are pcluster services related to the cluster updates

------

**CHANGES**
- Replace cfn-hup in compute nodes with systemd timers to support in place updates.
Copy link
Contributor

@gmarciani gmarciani Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must surface the value of this change for the user, which is providing better performance.
Also we should surface that the new mechanism relies on shared storage sync between head node and compute fleet.

chef_sleep '15'

wait_cluster_ready if cluster_readiness_check_on_update_enabled?
wait_cluster_ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this?
We should keep it as it may be useful to keep a way for the user to skip the readiness check if necessary.

node['cluster']['node_type'] == 'HeadNode' || node['cluster']['in_place_update_on_fleet_enabled'] == 'true'
end

def cluster_readiness_check_on_update_enabled?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as https://github.com/aws/aws-parallelcluster-cookbook/pull/3070/changes#r2641117741

I think it is useful to keep a way to skip cluster readiness check through chef attributes.
The current chef attribute name must be changed as in_place_update_on_fleet_enabled will not be valid anymore. We could expose something like cluster/update/cluster_readiness_check_enabled

default['cluster']['nfs']['hard_mount_options'] = 'hard,_netdev,noatime'

# Cluster Updates
default['cluster']['in_place_update_on_fleet_enabled'] = 'true'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +13 to +14
default['cluster']['shared_update_path'] = "#{node['cluster']['shared_dir']}/check_update"
default['cluster']['update_checkpoint'] = "#{node['cluster']['scripts_dir']}/update_checkpoint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to use more talkative names, such as

default['cluster']['update']['trigger_file'] = "#{node['cluster']['shared_dir']}/update_trigger"
default['cluster']['update']['checkpoint_file'] = "#{node['cluster']['scripts_dir']}/update_checkpoint"

With such names for the attributes we better convey that:

  1. they are files (when an attribute contains a directory path, it has dir as suffix)
  2. one is used as a trigger

fi'

[Install]
WantedBy=multi-user.target No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service is triggered by a timer. What is the point of having WantedBy=multi-user.target?
I think it could be removed as it is actually unused.

[Unit]
Description=Check for recent file modifications

[Service]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing logs for the logic executed by this service.
I suggest to set the property StandardOutput to log to a local log file. We must then push such file to cloudwatch.

Also, the logic should log more information:

  1. when it starts
  2. if the file exists
  3. what is the content of the checkpoint file
  4. what is the content of the trigger file
  5. the decision taken
  6. when it completes

All log lines must be timestamped with millis resolution.


[Service]
Type=oneshot
TimeoutStartSec=30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this timeout is here to prevent the service being stuck in case of file system unresponsiveness.
This is a good strategy as it is important to put a timeout when dealing with remote resources.
However, I think TimeoutStartSec may not be the right parameter to achieve this.

According to the official documentation for TimeoutStartSec:

If a daemon service does not signal start-up completion within the configured time, the service will be considered failed and will be shut down again.

So the timeout covers the whole start logic. The start logic here includes the execution of cfn-hup-update-action.sh, which can legitimately last longer than 30 seconds. So with the current approach we have the risk of interrupting legitimate updates.

If you agree with this risk, I suggest to configure the timeout logic differently:

  1. define a timeout of 20 seconds to read the shared files (20s is enough to capture file system failures with a simple read operation on a single file)
  2. define a timeout for the update action, which must be set through the node_bootstrap_timeout parameter, as it currently is. See https://github.com/gmarciani/aws-parallelcluster-cookbook/blob/wip/mgiacomo/3141/fix-build-ubhu-1218-1/cookbooks/aws-parallelcluster-environment/templates/cfn_hup_configuration/cfn-hook-update.conf.erb#L9-L9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants